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ABSTRACT 

Applications that learn from opinionated documents, like 
tweets or product reviews, face two challenges. First, the 
opinionated documents constitute an evolving stream, where 
both the authors’s attitude and the vocabulary itself may 
change. Second, labels of documents are scarce and labels 
of words are unreliable, because the sentiment of a word 
depends on the (unknown) context in the author’s mind. 
Most of the research on mining over opinionated streams 
focuses on the hrst aspect of the problem, whereas for the 
second a continuous supply of labels from the stream is as¬ 
sumed. Such an assumption though is utopian as the stream 
is infinite and the labeling cost is prohibitive. To this end, 
we investigate the potential of active stream learning algo¬ 
rithms that ask for labels on demand. Our proposed AC- 
OSTREAIVQ approach works with limited labels: it uses 
an initial seed of labeled documents, occasionally requests 
additional labels for documents from the human expert and 
incrementally adapts to the underlying stream while exploit¬ 
ing the available labeled documents. In its core, ACOSTR- 
EAM consists of a MNB classiher coupled with “sampling” 
strategies for requesting class labels for new unlabeled doc¬ 
uments. In the experiments, we evaluate the classifier per¬ 
formance over time by varying: (a) the class distribution 
of the opinionated stream, while assuming that the set of 
the words in the vocabulary is fixed but their polarities may 
change with the class distribution; and (b) the number of un¬ 
known words arriving at each moment, while the class polar¬ 
ity may also changelj Our results show that active learning 
on a stream of opinionated documents, delivers good perfor¬ 
mance while requiring a small selection of labels. 
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^Source code is available in R at: https://www. dropbox. 
com/s/y2ptl486f4rvohx/acostream_src.zip?dl=0 
“Datasets are available at: https://www.dropbox.eom/s/ 
gcpcyazp7fqentb/streams_acostream.zip?dl=0 
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Sentiment Discovery and Opinion Mining (WISDOM 2015), held in con¬ 
junction with KDDT5 in Sydney on 10 August 2015. Copyright of this 
work is with the authors. 


1. INTRODUCTION 

New communication media promote sharing social content 
conveniently, e.g. opinions, ideas, thoughts etc., with ev¬ 
eryone connected to the WWW. Blogs, social networks and 
microblogging are the common services to pose experiences 
1^. Peoples contributions to such services ordered by time 
of their publication constitute a stream of opinions. 


An opinion is represented by a document that conveys sen¬ 
timent; some of its words have a polarity, but these word 
polarities do not necessarily determine the polarity of the 
document. On the other hand, a word appears in many 
opinionated documents, and the polarity of these documents 
gives an indication on whether this word is used to describe 
positive or negative sentiment. Moreover, polarity learning 
on a stream of documents is driven by scarcity of labeled 
data, since up to date labeled reviews or tweets are not 
available - it is impractical to expect that a human expert 
inspects and labels arriving reviews or tweets on sentiment, 
especially in an infinite data stream scenario 16 . In this 


study, we investigate how the active acquisition of labels on 
document polarity can contribute to learning and adapting 
upon an ongoing stream of documents. 


According to Mohri [20] , the goal of active learning is to 
achieve a performance comparable to the standard super¬ 
vised learning scenario, but with fewer labeled examples. 
Model inference and adaption over streams lends itself to 
active learning, since the acquisition of fresh labels for all 
documents of an ongoing fast stream is impracticable. How¬ 
ever, learning polarity on streams is subject to two chal¬ 
lenges. First, the vocabulary evolves, as new words show 
up, and as the positive/negative connotation of some words 
changes. Second, the document polarity model evolves, in 
the conventional sense of concept drift - the likelihood of 
one polarity class becomes higher than before. Most con¬ 
ventional polarity stream mining algorithms, including ac¬ 
tive learning variants address drift of the document polarity 
model but assume that the vocabulary is fixed and known in 
advance |15| [3]. In this work, we propose an active stream 
learning approach for evolving feature spaces. In the core of 
our approach, there is a Multinomial Naive Bayes (MNB) 
classifier, which allows for an easy maintenance of class and 
word-class statistics over time. 


In our earlier work 31 2^, we proposed polarity stream 
learning algorithms that adapt to an evolving vocabulary 
in the stream. However, in |25| we assume that fresh docu- 






ment labels are made available at each moment, while in 
we assume solely an initial seed of labeled documents and 
then we adapt the model in a semi-supervised way. On the 
contrary, in this study, we propose an active stream learn¬ 
ing algorithm which requests document labels on demands 
based on the need for adapting the classifier to the underly¬ 
ing stream. 


positive and negative words and test the tweet against the 
profile to decide on its class. The profile is maintained online 
over the stream; initially a small set of words is included 
but the seed set is expanded by also including words that 
co-occur often in the stream with words in the seed set. We 
also expand in a word-basis, however our approaches are 
broader rather than topic specific. 


This work is organized as follows. Related work is discussed 
in Section The basic concepts of ACOSTREAM, the in¬ 
crementally updating process and the sampling strategies 
for document label acquisition are presented in Section 
Experimental results are shown in Section Conclusions 
and open issues are discussed in Section 


2. RELATED WORK 

Active learning is a prominent choice when dealing with 
problems where labeled data are expensive to obtain, e.g. po¬ 
larity classihcation or computational biology applications. 
There exist various active learning approaches, provided in 
recent surveys such as [S 21 . They differ in their heuristics 
to select instances for which the true label is requested. Gar- 
. use the most likely or the most pessimistic pos- 
c\d) 


nett et al 
terior P(c 


made by a current model. In contrast Krempl 


et al. [1^ and Ho et al. 10 weight the posteriors by their 
likelihood resp. use hypotheses testing to include the re¬ 
liability of the posterior when selecting the next instance. 
All these approaches follow the same framework: they se¬ 
lect the next instance and relearn the classifier with the new 
instance. Relearning is expensive in terms of runtime when 
dealing with large streams as we do. Our approach works 
incrementally, thus it does not require relearning rather is 
expands the current model with new instances. 


In context-sensitive learning, it is assumed that the label of 
a word depends on the context it is used in. Methods that 
trace recurring concepts and those that monitor con¬ 

text change 17 [91 can trace the association of a word to 


a label, but only for a limited number of existing contexts, 
respectively recurring concepts, and for a fixed vocabulary. 
Therefore, we concentrate on learning with an evolving vo¬ 
cabulary without making assumptions about concept recur¬ 
rence or context switching. 


Zliobaite et al. propose two sampling strategies which 
are flexible towards a growing collection as well as consid¬ 
ering concept change. The latter is covered while allowing 
the learner to select also samples which are not close to the 
decision boundary, i.e. for which the classifier is very certain, 
so that the classifier will not miss concept change. Boy et 
al. test uncertainty and relevance sampling with dif¬ 
ferent classifiers. It is used to acquire more examples from 
a class which is scarce. Their results expose that Multino¬ 
mial Naive Bayes (MNB) classifier performs best for both 
sampling techniques on polarity classification. We also use 
MNB as classifier. 


Yerva et al. propose an active stream learning based 
classifier for classifying tweets into relevant or irrelevant for 
a given company. Their idea is to built a company profile of 

^Relevance sampling regards the labeling o f th ose examples 
which are most likely to be class members [14|. 


Recently Kranjc et al. present an active learning frame¬ 
work for selecting the most suitable tweets w.r.t. an initial 
trained classification model. They use as a Support Vec¬ 
tor Machine and re-build the model as soon as new suitable 
tweets are selected. They select suitable tweets based on 
uncertainty and random sampling. Similarly contribute 
an active learning approach distinguishing opinionated (pos¬ 
itive and negative) from non-opinionated (neutral) tweets 
in finance twitter data streams. Based on an SVM classi¬ 
fier, Smailovic et al. determine a query strategy for active 
learning, combining advantages from uncertainty and ran¬ 
dom sampling. 

We skip a discussion on the most recent polarity classifica¬ 
tion algorithms such as Socher et al. as the contribution 
of our work is towards active learning strategies for polarity 
classification rather than pure polarity classification. 

3. ACTIVE OPINION STREAM LEARNING 

We observe a growing collection D of documents that con¬ 
stitute a stream, which we monitor at distinct timepoints 
... ,ti,.... Documents arrive at each ti. A document 
d G is represented by the bag-of-words model, i.e. d = 
wi,W 2 , ■ ■ ■ j Wn. We further assume an initial labeled seed set 
S of documents: for each d G 5, an expert has assigned a 
polarity label c G C (C is the set of possible labels, e.g., pos¬ 
itive, negative). We borrow the notation of initial seed set 
from our previous work proposed in [^. As the stream 
progresses the concept of words might change, i.e. a word 
which is used to express positive polarity might change its 
contextual relation so that it is used to expressing negative 
thoughts. Moreover, new words - previously unknown words 
- might appear as peoples’ vocabulary to express their posi¬ 
tive or negative opinion evolves over time. The mining goal 
is to assess the polarity label of incoming documents while 
considering concept change and new words in the stream. 

3.1 ACOSTREAM Overview 

An overview of our approach is depicted in Algorithm 
Briefly, it works as follows: The seed set 5 is used to initially 
train a classifier A(5) upon the true labels of 5; the docu¬ 
ment labels are propagated to their component words; this 
way the vocabulary V (line 2) is derived. The vocabulary 
consists of the words observed in 5 and their distribution 
in the positive, negative class. Note that these counts are 
adequate to approximate the class-conditional word proba¬ 
bilities and the class probabilities in MNB. We employ the 
classifier to predict the label for each arriving new document 
d from the stream (line 4). Depending on the active learn¬ 
ing sampling strategy (cf. Section [3.3| ), we might request the 
true label c for d by an expert (line 7). If this is the case, 
we update the related word-class counts and class counts in 
the model, for all words appearing in d and the true label c 
of d (lines 8-10). If we encounter some new word, i.e., not in 
the current vocabulary, we expand the vocabulary accord- 








ingly and start monitoring their occurrences in the different 
classes (lines 10-12). Moreover we update the documents- 
class counts and the seed set while adding documents to 5 
(lines 13-14). 


document d while employing A on d: 

Ml 

classed) = argmaxP(c|d) oc P(c) ]| P{wi\c) 


Note that the classifier’s predictions are always made on 
the current (updated) seed set 5. That is, the classifier 
is a lazy learner. Moreover, the seed set consists always of 
true-labeled documents, i.e., labeling was done by an expert. 
This implies, that the classifier is always trained upon true 
labeled (and therefore, reliable) instances. 


Algorithm 1: ACOSTREAM 
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Input: initial seed 5, stream V 

A -ir- train initial classifier on seed 5; predictedLabels •<— 0 
V extract all words from 5 

while P do 

d •<— next document from stream 
p •<— predict label for d by A(5) 
if d is sampled w.r.t. p then 
c request true label for d 

II incrementally update word-class counts 
for i=l to |d| do 

// for existing words 
it Wi & V then Ni^ = Ni^ + 111 for new 
words 
else 

Ni^ = l 

V V yj Wi !! expand vocabulary 


13 

14 


Nc = Nc + 111 update class counts 
5Ud// update seed set 


We provide more details in the next subsections. 


3.2 Building and Maintaining a Polarity Clas¬ 
sifier Over Time 

Based on the initial seed set 5, we propagate the class labels 
of the documents to their component words Wi G V, where 
V is the set of words derived from the documents d £ S. We 
obtain for each word the word-class counts Wc stating the 
number of times Wi has occurred in documents with class 
label c, i.e. 

Nic = \{wi ■. 3d £ S,Wi £ d A class{d) = c}| 


That is, the class label of a new document d is the one 
maximizing the posterior probabilities P(c|d),c £ C, which 
depends on the class conditional probabilities of the words 
in the document and makes the assumption that these words 
are independent given the class. 

The class prior equals to the ratio of documents in S labeled 
as c and the total number of labeled documents |<S|, i.e. 

p(c) = m/\s\ (1) 


Analogously, the conditional probability of a word Wi given 
a class c equals to the ratio of documents in S which are 
labeled as c and contain the word Wi. 


P{Wi\c) 


N'ic + 1 

Ei='iiVtc + |R| 


( 2 ) 


We apply the Laplace corrector, 1/|R|, to alleviate the zero 
frequency problem for words that have not been observed 
under a given class. 


3.3 Actively Selecting Documents to Acquire 
New Labels 

As the stream of documents underlies changes w.r.t. the em¬ 
pirical word-class distributions P{wi,c), the empirical class 
distributions P(c) and new appearing words, the initial clas¬ 
sifier A (5) trained upon the initial seed set S might become 
outdated over time. The solution is to update the classifier 
in order to respond to these changes. To this end, we incor¬ 
porate new documents into the seed set S and accommodate 
new words to the vocabulary. We further incrementally up¬ 
date word-class counts Nic and document-class counts Nc- 
However, we only extend 5 by documents which are actively 
sampled, i.e. for which we requested a true label by an ex¬ 
pert. There are different techniques for actively sampling 
labels for new documents; we instantiate our approach with 
two alternative strategies, one based on information gain and 
another based on uncertainty, discussed in the following sec¬ 
tions. Our approach, though, can be coupled with different 
sampling approaches for labeled document acquisition. 


We further derive the document class counts Nc expressing 
the number of documents with label c, i.e. 

Ac = |{d : class{d) = c}| 

Upon the class and word-class counts we compute the empir¬ 
ical class distributions P(c) w.r.t. class c and the empirical 
word-class distributions P(wi\c) for each word Wi £ V, as 
described in the following section. We use a ”hat” as in P 
to denote empirical estimates hereafter. 


3.3.1 Sampling by Information Gain 
We select a new document d for the extension of 5 that 
shows a gain in information with respect to the thus far 
observed word-class distribution of words Wi £ d and the 
distribution after considering the predicted label for d. The 
usage of the information gain is motivated by the attribute 
selection measures used in decision trees and our previ¬ 
ous work [30|. It is defined as follows: 


Framing the empirical distributions we build a Multinomial 
Naive Bayes classifier A. It is very fast for induction, robust 
to irrelevant attributes, while providing good prediction per¬ 
formance [^. We assess the polarity label of an arriving 


Definition 1. [Information Gain] Let d be a new docu¬ 
ment containing words Wi £ d, for which the current clas¬ 
sifier A predicts, for instance, the positive polarity label -f. 
The Information Gain of d w.r.t. the predicted label relies 











upon the difference in entropy before and after the addition 
of the new label +. 

/G(d)= ^ H{Ni+,Ni-)-H{Ni+ + l,Ni-) (3) 

■WiSdAeV 

Here, H{Ni+, Ni-) is the entropy of Wi regarding the two 
polarity classes + and —, which expresses the purity of the 
class distribution based solely on Wi. The second term, 
H{Ni+ + is the entropy of Wi when considering d 

and its predicted label, + in this example, as part of the 
seed set. 


The entropy of two positive values a, fo £ N 

a , b 




H(a,b) = - 


a + b 




a b 


words. 

3.3.2 Sampling by Uncertainty 

As a second sampling strategy for acquiring true document 
labels, we utilize uncertainty. The idea of uncertainty sam¬ 
pling is to ask the expert for labeling an instance for which 
the current classifier is less certain, i.e. for which the cer¬ 
tainty is below some fixed threshold a [^. Since uncertain 
examples are close to the classifier’s decision border, accom¬ 
modating them makes the predictions of a classifier more 
distinctive. According to our MNB classifier we use the pos¬ 
terior probability estimates P{+\d) and P{—\d) computed 
by MNB as measure for certainty. A low posterior probabil¬ 
ity means that the classifier is less certain. The uncertainty 
is then defined as: 


Documents that increase the information reflect the current 
classifier very well and also enhance the classifiers perfor¬ 
mance while following the thus far observed word-class dis¬ 
tributions, i.e. the distributions become more pure and thus 
the predictions are less random. A document that shows a 
gain in the information w.r.t. a predicted label c is sampled, 
i.e. the true label provided by an expert is requested and 
then utilized to update the classifier. 

We update the classifier based on the received true label. 
Considering the predicted and the true label for d, there are 
two possible scenarios: (i) the predicted label matches the 
true label, i.e. the classifier is enhanced in its decision when 
being updated with the true label, and (ii) the predicted la¬ 
bel is different from the true label. The latter case occurs if 
the classifier does not reflect the current concept underlying 
the stream, i.e. it makes a wrong prediction. The current 
concept is assumed to be reflected by the true label of the 
document. Therefore, A must be updated with the true 
label so that the concept of the related word-class distribu¬ 
tions can be changed according to the underlying population 
of the stream. Hence, we do not miss concept change since 
we update with the true label. 

In case of changes in the word-class distribution, the in¬ 
formation gain relies on frequent and old words, i.e. words 
which have appeared in many documents over time, rather 
than on words that just newly appeared. This is to be pre¬ 
ferred as frequent and old words carry more evidence re¬ 
garding the class. A toy example shall help to depict this: 
assuming a word w that occurred thus far in 30 positive 
documents, further a new document d appears bearing w 
and the classifier predicts the negative label for d, so there 
is a change in w. The entropy difference for w would be 
(1/31) * ^ 032 ( 1 / 31 ), this is a small value so that it is likely 
that there is still a gain in information if the class distribu¬ 
tion of the other words in d are promoted by the negative 
label; and thus d is selected to update the classifier. It is 
easy to see that the entropy difference regarding w is higher 
if w has appeared less than 30 times before the change oc¬ 
curs. Hence, we trust more frequent and old words when a 
change occurs. 

It is noted that when computing the information gain we 
consider only the entropy difference over words Wi € d, in¬ 
stead of all words w G S, i.e. we do not iterate over all the 


Definition 2. [Uncertainty] Let d be a new document and 
A (5) be the current classifier that computes the posterior 
probabilities of the two classes (-I-, -). The predictions of 
A (5) are considered as uncertain if: 


argmaxP(c|d) < a 
cef-i-,-} 


where a is a value in (0,1). 


The parameter a is selected manually: small values ensure 
only few documents to be sampled and thus to update the 
classifier with documents very close to the decision bound¬ 
ary. That is, if the threshold is selected too small then the 
classifier will miss changes. In contrast, bigger values assume 
more examples to be sampled. They also allow sampling of 
documents that are far from the border and which might 
bear concept change. This also implies, more label requests 
though. 


4. EXPERIMENTS 

To evaluate ACOSTREAM, we experiment with two real 
world datasets of opinionated documents (product reviews 
and tweets). The original streams were modified in order to 
test the performance of ACOSTREAM in extreme and less 
extreme cases. A detailed description of the datasets is given 
in Section 4.11. We compared our ACOSTREAM against 

The results of our 


several baselines presented in Section 
experiments are presented in Section 


4.2 


4.3 


4.1 Datasets 

Stream StreamJi comes from a dataset first introduced by Yu 
et al. which contains data crawled from cnet.com, view¬ 
points. com, reevoo.com and gsmarena.com. The true labels 
of the reviews were derived by the authors from star-ratings. 
The reviews cover mostly products and their properties such 
as “phone”, “firmware” and “price”. We use only reviews de¬ 
scribing single product features, after removing very short 
reviews containing less than 2 adjectives. More details on 
the dataset and our preprocessing are also provided in our 
previous work [^. The StreamJi dataset contains 11.374 
product reviews and a vocabulary of 3.048 different words. 

Stream TwitterSentiment, first introduced in [^, was col¬ 
lected by querying the (non-streaming) Twitter API for mes- 











sages between April 2009 and Jnne 25, 2009. The stream 
is very heterogeneous regarding the content. The true la¬ 
bels (ground truth) of the tweets were acquired through the 
Maximum Entropy classifier using emoticons as class labels. 
The stream also depicts a very strong concept shift towards 
its end, as only one of the two classes, the negative ones, 
is observed at the end of the stream. The original stream 
contains 1.600.000 tweets; we focus on the last part of the 
stream, tweets 1.235.000 - 1.485.000, reflecting concept drift. 
The selected dataset consists of 250.000 tweets with a vo¬ 
cabulary of 169.853 different words. 


In StreamJi we focused only on adjectives and adverbs for 
sentiment analysis since, according to ^4 27 , these words 
bear the actual opinion of the author; similar observation 
were shown in Stream TwitterSentiment comes with 

nouns and verbs as stated in [^. 


4.1.1 The effect of new appearing words 
In our experiments we show how ACOSTREAM performs 
in a continuously expanding vocabulary V,i.e., when new 
words arrive over time from the stream. To this end, we 
re-order the original streams so that the number of appear¬ 
ances of words from the initial seed set 5 decreases over 
time whereas the number of new words increases. That 
is, vocabulary-wise the initial seed set becomes “outdated” 
w.r.t. the evolving stream. The ordering was done as in our 
previous work [30] . 

Based on the ordering procedure, we obtain for each origi¬ 
nal stream a re-ordered counterpart which begins with doc¬ 
uments that contain only words from the initial vocabulary 
V extracted from <S; as the stream progresses, the number of 
new words increases while documents arrive that also con¬ 
tain words w ^ V. In Figure]^ we draw the percentage of 
known and new words per document over time for the re¬ 
ordered versions of the streams averaged over batches of size 
42 resp. 5000. In the very beginning all words are known, 
over time though, the ratio of known words decreases with 
unknown words dominating the stream. 



# batches 

known wordsHnew words|first time occurring words 



"known wordsHnew wordsBfirst time occurring words 


Figure 1: Percentage of known, new and first ap¬ 
pearing words over time (avg. per batch) for the 
re-ordered version of stream StreamJi (top) |iS|=140 
resp. TwitterSentiment (bottom) |cS| =5.000 


We distinguish the unknown w.r.t. to the initial seed set 
words into i) first-time observed new words (in gray) and ii) 
already monitored new words (in blue). In the re-ordered 
versions in Figure the number of new words is increasing 
over time and after some point the stream bears merely new 
words; whereas the number of first-time observed words is 
rather static over time showing a continuously increasing 
variety of words. The reason for re-ordering is to show how 
the classifier deals with an expanding vocabulary. 

The class distributions of the streams is depicted in Figure[^ 
StreamJi is slightly skewed towards the negative class over 
the whole stream while TwitterSentiment is uniformly dis¬ 
tributed at the beginning whereas, as the stream progresses, 
the distribution moves more towards the positive class. 


o 

o ^ 



1 701 1701 2801 3901 5001 6101 7201 8301 9401 10601 
Documents Over Time 



1 19997 49991 79985 114978 149971 184964 219957 

Documents Over Time 


Figure 2; Class distribution on StreamJi (top) and Twit¬ 
terSentiment (bottom) accumulated over batches of size 
100 resp. 5.000 

The obtained re-orderings of the original streams bear also 
changes in the polarity of words, i.e. the word-class distri¬ 
butions changes over time. Figure [^depicts the word distri¬ 
bution of the words “best” on StreamJi and “tomorrow” on 
TwitterSentiment as accumulated ratio of documents with 
positive (green) resp. negative label(red): the distribution 
of both words change over time, e.g. for word “tomorrow”, 
the ratio of negative documents alternates heavily as for 
instance, at document 13.800 only negative documents are 
shown followed by a majority of positive documents. 

4.1.2 Fixed Vocabulary 

The scenario where new words appear over time is an ex¬ 
treme one; though, it is a rather realistic one in polarity 
learning over streams. To apply our approach on a less ex¬ 
treme scenario, we run experiments on streams showing up 
NO new words over time, i.e. the seed contains all words of 
the stream. 
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Documents Over Time 


Figure 3: Word-class distribution of the words “best” on 
StreamJi (top) and “tomorrow” on TwitterSentiment (bot¬ 
tom) accumnlated over batches of size 50 resp. 1.000 and 
depicted as freqnency in percentage 
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Documents Over Time 
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Documents Over Time 


Figure 4: Class distribution of streams StreamJi and Twit¬ 
terSentiment showing no new words over time w.r.t. seeds 
with sizes of 1.000 resp. 10.000 docnments 


Therefore, we reduced the original streams, keeping the orig¬ 
inal order though, to documents that contain only words 
which are part of the initial vocabulary V extracted from 
the seed. We acquired the shortened stream while select¬ 
ing a relatively large seed 5 (1.000 for StreamJi and 10.000 
for TwitterSentiment). Based on S, we extracted the vocab¬ 
ulary V, and as the stream progresses we considered the 
documents d that contain only words w G V. 

The class distribution of the constituted streams is depicted 
in Figure]^ We aggregated the number of positive and neg¬ 
ative documents over batches of size 100 and 1000 for Str¬ 
eamJi resp. TwitterSentiment. The resulting versions of the 
stream are smaller than the original version: StreamJi con¬ 
tains 7.018 documents and 759 words while TwitterSenti¬ 
ment covers 81.480 tweets and 14.785 words. 

Similar to the re-ordered versions of the streams, described 
in cf. Section |4.1.1[ the shortened stream bears concept 
change of the words. We skip detailed figures on specific 
words though as they mostly conform with the word distri¬ 
butions depicted in Figure]^ 

4.2 Learning methods and quality measures 

Below we outline the approaches we used to compare against 
ACOSTREAM. They all use Naive Bayes as classifier but 
differ on which documents they use for adaptation. 


• IncrementalMNB: The classifier is updated gradu¬ 
ally with each incoming instance based on the true 
labels of the instances. It assumes 100% availability of 
true labels. This approach serves as an upper baseline. 

• StaticMNB: The classifier is not updated over time, 
rather is is trained once upon the initial seed set and 
remains static over the whole stream. This approach 
serves as a lower baseline. 

• Random: The random sampling strategy labels the 
incoming instances at random instead of deciding ac¬ 
tively on the relevance of the label. For every incoming 
instance the true label is requested with a probability 
B, where B is the budget [^. We switch the budget 
in our experiments among 0,3 and 0,6, e.g., 30% of 
the documents from the stream are asked for the true 
label. 


To evaluate the quality of our classifiers, we use the kappa 
statistic, which normalizes the classifier’s accuracy by the 
accuracy of a chance classifier: k = |^- 

Po is the accuracy of a classifier and Pc is the probability of 
making a correct prediction by a chance classifier that as¬ 
signs the same number of examples to each class as the clas¬ 
sifier under consideration. The kappa varies among -1 and 
1 : a value < 0 indicates that the classifier’s predictions co¬ 
incide with, or are worse, than the predictions of the chance 
classifier. A value > 0 implies that the classifier’s predic¬ 
tions overcome these of a chance classifier. The higher the 
value, the more often the predictions match with the true 
labels. Kappa is preferred to accuracy for data streams as 
it can handle imbalanced class distributions. 































































4.3 Performance evaluation 

In this section, we compare ACOSTREAM using informa¬ 
tion gain and uncertainty sampling strategies against the 
IncrementalMNB, the StaticMNB as well as the random 
sampling based on the performance of kappa over time. As 
we deal with an evolving stream of documents a fixed bud¬ 
get of true labels cannot be utilized which is normally ap¬ 
plied when comparing across different sampling strategies 
[32| . This would, however, lead to an unfair comparison as 
the budget would be spent differently among the strategies. 
Rather we used different values for the uncertainty threshold 
a and for random sampling across our experiments yielding 
to different number of requested labels over the stream. We 
depict the number of requested labels over the stream in per¬ 
centage of the stream length in Table IncrementalMNB 
always asks for 100% of the labels, while StaticMNB uses 
only the true labels of the training set S. We implemented 
two experiments: i) we kept the vocabulary fixed over the 
stream while considering documents that contain only words 
w € V, cf. Sect ion [ 4 . 1.1 and ii) we allow the set of words V 
to evolve as including new appearing words. 


In the following we examine the performance of ACOSTR¬ 
EAM on the two experiments comparing against the base¬ 
lines described in Section [4.21 


4.3.1 Results on the Fixed Vocabulary Stream 

We report on the results carried out from our experiments 
upon streams with a fixed set of vocabulary and evolving 
word-class counts, cf. Section 4.1.2[ and while using kappa 
as evaluation measure. Figure Sfdepicts the kappa over time 
for ACOSTREAM using information gain [AcostreamAg) 
and uncertainty {Acostream_u) sampling, IncrementalMNB, 
StaticMNB and Random sampling on the shortened streams 
StreamJi (upper picture) and TwitterSentiment lower picture. 

ACOSTREAM shows a good performance on both streams 
when applying information gain sampling. The results ex¬ 
pose, upon stream StreamJi, a kappa that is rather close to 
the kappa of the upper baseline while utilizing only 44% of 
the true labels (cf. Table [^; on TwitterSentiment it over¬ 
comes the IncrementalMNB in most times of the stream 
using only 40% of the labels. Hence, information gain sam¬ 
pling performs very well requesting only 40% resp. 44% of 
the labels to achieve a comparable or higher kappa than 
IncrementalMNB which samples 100% of the labels. In con¬ 
trast, uncertainty sampling, which uses 40% resp. 47% of 
the labels on StreamJi resp. TwitterSentiment, shows a lower 
kappa similarly to the results obtained by random sampling. 
The reason why kappa drops to 0 at the end of Twitter¬ 
Sentiment is because there only negative documents arrive, 
consequently one cannot be better than a chance classifier. 

4.3.2 Results on continuously expanding vocabulary 
We examine how the performance of ACOSTREAM is af¬ 
fected by a continuously expanding vocabulary and evolv¬ 
ing word-class distributions. On stream StreamJi informa¬ 
tion gain sampling performs very well showing the highest 
and most robust kappa over time among all approaches to 
which we compare using only 60% of the labels, depicted 
by the picture on top of Figure Uncertainty sampling 
does not perform well on stream StreamJi showing a lower 



Documents over time 



Documents over time 


Figure 5: Kappa for the three methods to which 
we compare and ACOSTREAM on stream Str¬ 
eamJi (top) and TwitterSentiment (bottom) with a 
fixed vocabulary 










Experiment -|- Dataset 

ACOSTREAM (IG) 

ACOSTREAM (U) 

IncrementalMNB 

StaticMNB 

Random 

fixed V'. StreamJi 

44 

40 

100 

1 

40 

fixed V'. TwitterSentiment 

40 

47 

100 

1 

42 

evolving V: StreamJi 

60 

59 

100 

1 

60 

evolving V: TwitterSentiment 

52 

88 

100 

2 

31 


Table 1: Requested labels per method and experiment: numbers in percentage regarding the length of 
the stream including the documents to train the classifier, i.e. the size of the seed. (IG=Information Gain), 
(U=Uncertainty) 


kappa in comparison to random sampling. On TwitterSen- 
timent it performs well but requiring 88% of the labels to 
be competitive with ACOSTREAM when using information 
gain sampling that acquires only 52% of the labels. The 
results on stream TwitterSentiment, depicted by the bottom 
picture of Figure reveal that IncrementalMNB performs 
best on large streams with many words (169.853). ACOS¬ 
TREAM (both sampling strategies) follows while showing a 
similar pattern of the kappa curve but with slightly lower 
values. 



CD 

O 


o 


CVj 

o 


p 

o 


Acostreamjg 

IncrementalMNB 

StaticMNB 



50000 100000 150000 200000 250000 


stream and thus considers all changes of the word distribu¬ 
tions. That is, ACOSTREAM, in particular when informa¬ 
tion gain is used, adapts well to the underlying change in 
the population of the stream. 

4.3.3 Ejfect of the uncertainty threshold a 
To show the effect of the uncertainty threshold a, cf. Sec¬ 
tion |3.3.2[ we varied values of a on stream StreamJi and 
TwitterSentiment when the vocabulary has a fixed size. Fig¬ 
ure [^depicts kappa over time on StreamJi (upper picture) 
and TwitterSentiment (lower picture) for different settings of 
a when uncertainty is used as sampling strategy upon AC¬ 
OSTREAM. We varied among five values: e(-2), e(-lO), e(- 
20), e(-30) and e(-40), where e() is the exponential function. 
Note that the posteriors become rather small as dealing with 
a sparse feature space. Thus we had to set small values for 
a in order to cause difference in the consumption of labels. 

The results on both streams show and increasing perfor¬ 
mance while taking larger values for a into account. This 
is not surprising, as with increasing a also the number of 
considered samples grows which intuitively leads to a better 
performance. The gap in performance among values for a is 
huge on stream StreamJi where 100%, 87%, 33%, 19%, and 
15% percent of labels are requested; while on TwitterSenti¬ 
ment 100%, 92%, 74%, 55% and 40% of the documents are 
sampled, leading to smaller gaps between the curves. 

5. CONCLUSION 

Polarity learning on an evolving stream is a challenging 
task as the stream is subject to concept changes; existing 
words might change sentiment over time due to e.g., dif¬ 
ferent context, but also new words might occur to express 
opinions. Another challenge for a stream polarity learner 
is the scarcity of the class labels, assuming manual labeling 
of the (infinite) stream is unrealistic. Responding to these 
challenges requires adaptation of the model to the underly¬ 
ing stream population based on only a few labeled examples. 
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Figure 6: Kappa for the three methods to which 
we compare and ACOSTREAM on stream Str¬ 
eamJi (top) and TwitterSentiment (bottom) under an 
evolving vocabulary 

ACOSTREAM is not negatively affected by new words and 
exposes a stable performance across both streams. Also, 
the curves show a pattern similar to the one obtained from 
the IncrementalMNB that adapts with all documents of the 


In this work, we proposed our active stream learning frame¬ 
work ACOSTREAM for incrementally updating a polarity 
learner based on actively acquired document labels. We in¬ 
stantiate our framework with two sampling strategies, infor¬ 
mation gain and uncertainty. We compare our method to a 
traditional active learning approach (random sampling), an 
incremental approach that requires all arriving document 
labels and a non-adaptive method. Our results show that 
actively asking for labels, pays off as the performance of the 
classifier is quite good while the label consumption remains 
low. Comparing the two sampling approach, information 
jam-based sampling shows good performance on all datasets 






















w.r.t. the number of required labels, the accuracy of predic¬ 
tions and adaptation to concept change. The uncertainty- 
based sampling on the contrary shows a poor performance. 

Our ongoing work involves more elaborated techniques on 
propagating document labels to words, considering that not 
all words contribute the same to the polarity of a document. 
Furthermore, we want to diminish independence between 
new documents when deciding to sample them. This will al¬ 
low us to sample in a wider prospect detecting change early 
and address emerging scenarios comprehensively. Moreover, 
we plan to instantiate ACOSTREAM with different classi¬ 
fiers (except for the currently employed MNB) and different 
sampling strategies for active learning. 
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Figure 7: Kappa for different settings of a when 
using uncertainty as sampling strategy for ACOS¬ 
TREAM on stream StreamJi (top) and TwitterSenti- 
ment (bottom) 
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